We’ve talked about tidy format – each variable a column, each observation a row, each type of observational unit a table. We’ve taken speeches at the Nobel prize and pages of oil company Sustainability Reports as our observations and can use this to do some transformations – do total word counts or specific word frequencies. But once we start really doing analysis at the level of the word, tidy rules would seem to dictate that we treat not individual texts but individual words as observations. Thus tidytext format is a table with one word per row.
It might help if we are more specific: we can use two signifiers for words - words and tokens. Words are unique, well, words as we conceive of them. Tokens are each individual instances of a word. So a sentence: “the brown fox jumped over the brown log” we will say has 8 tokens (number of total words) but 6 words (number of unique words, as “the” and “brown” appear twice).1 So tidy data format is one token per row.2
We convert character strings that we have been working with into tidy
text with the unnest_rows() function in the
tidytext package, which splits texts up by tokens (a
process appropriately called “tokenization”).
# Lets use the first speech in our Nobel corpus as an example
library(tidytext)
library(tidyverse)
nobel <- read_rds("data/nobel_cleaned.Rds")
ex <- nobel[1,]
ex %>%
unnest_tokens(output = words, input = AwardSpeech)
## # A tibble: 598 × 3
## Year Laureate words
## <dbl> <chr> <chr>
## 1 1905 Bertha von Suttner "on"
## 2 1905 Bertha von Suttner "behalf"
## 3 1905 Bertha von Suttner "of"
## 4 1905 Bertha von Suttner "the"
## 5 1905 Bertha von Suttner "nobel"
## 6 1905 Bertha von Suttner "committee"
## 7 1905 Bertha von Suttner "bj\u00f8rnstjerne"
## 8 1905 Bertha von Suttner "bj\u00f8rnson"
## 9 1905 Bertha von Suttner "introduced"
## 10 1905 Bertha von Suttner "the"
## # … with 588 more rows
## # ℹ Use `print(n = ...)` to see more rows
We now end up with a dataframe where each row is an observation with
words, year, and Nobel laureate as variables. Note also that
tidytext does cleaning - whitespace stripping, to_lower and
so on. We didn’t notice here as we’re using our dataframe that we have
already cleaned. The first thing we might notice is the number of rows
is the total word count.
Another way in which texts are frequently pre-processed for analysis
and we can now look at is to remove so-called stop words. Stop words are
words with grammatical rather than syntactic function - they make things
we say grammatical but don’t add (much) meaning. Examples include “a”,
“the”, “of” and so on. What counts as a stopword might vary on our
corpus! And what sort of analysis we’re trying to do.3 All major text
analysis packages have means for removing stop words. In ‘’tidytext’’ we
have a list of stopwords called stop_words from a couple of
different corpora (?stop_words for more info and
links).
stop_words
## # A tibble: 1,149 × 2
## word lexicon
## <chr> <chr>
## 1 a SMART
## 2 a's SMART
## 3 able SMART
## 4 about SMART
## 5 above SMART
## 6 according SMART
## 7 accordingly SMART
## 8 across SMART
## 9 actually SMART
## 10 after SMART
## # … with 1,139 more rows
## # ℹ Use `print(n = ...)` to see more rows
nobel_tidy <- nobel %>%
unnest_tokens(output = words, input = AwardSpeech) %>%
anti_join(stop_words, by = c("words" = "word")) # by= specifies which columns to use, had they been named the same thing we could have omitted it
stop_words, we’ll recall, is a tibble, so we can easily
make our own tibble of stopwords and add it. Say (for purposes of
demonstration) I think that the Peace Prize award ceremony is, of
course, going to talk about peace and war so we want to weed these words
out of our frequency counts.
my_words <- c("peace", "war")
custom_stop_words <- tibble(word = my_words, lexicon = "my_customization")
stop_words_custom <- rbind(stop_words, custom_stop_words)
tail(stop_words_custom) # view the end of the tibble, look like our words were added correctly
## # A tibble: 6 × 2
## word lexicon
## <chr> <chr>
## 1 younger onix
## 2 youngest onix
## 3 your onix
## 4 yours onix
## 5 peace my_customization
## 6 war my_customization
Now we can apply the stop_words_custom just like we did
stop_words. We won’t actually do this because we probably
want these words in the corpus!
If we are looking at word frequencies, will we want to count “represent” and “represented” as the same word? Or “war” and “wars”? Very likely. If so we can transform our text via a process called stemming, cutting down words to their stems so that different forms of these word are recognized as being the same thing. Something like this matters in English but might really matter in an more highly inflected language like Russian.
We have a couple stemmers to choose from in R, one and the best known is the Porter Stemming algorithm. Another is hunspell, based on the popular open source and multilingual spell checker. We’ll use Porter here.
To see it in action we’ll first test it against a short test set of words and then apply to our whole corpus.
library(SnowballC)
words_to_stem <- c("going", "represented", "wars", "similarity", "books")
SnowballC::wordStem(words_to_stem)
## [1] "go" "repres" "war" "similar" "book"
Stemming in action. So now applied to the entire document:
(nobel_tidy_stemmed <- nobel_tidy %>%
mutate(word_stem = wordStem(words)))
## # A tibble: 94,107 × 4
## Year Laureate words word_stem
## <dbl> <chr> <chr> <chr>
## 1 1905 Bertha von Suttner "behalf" "behalf"
## 2 1905 Bertha von Suttner "nobel" "nobel"
## 3 1905 Bertha von Suttner "committee" "committe"
## 4 1905 Bertha von Suttner "bj\u00f8rnstjerne" "bj\u00f8rnstjern"
## 5 1905 Bertha von Suttner "bj\u00f8rnson" "bj\u00f8rnson"
## 6 1905 Bertha von Suttner "introduced" "introduc"
## 7 1905 Bertha von Suttner "speaker" "speaker"
## 8 1905 Bertha von Suttner "baroness" "baro"
## 9 1905 Bertha von Suttner "bertha" "bertha"
## 10 1905 Bertha von Suttner "von" "von"
## # … with 94,097 more rows
## # ℹ Use `print(n = ...)` to see more rows
write_rds(nobel_tidy_stemmed, "data/nobel_stemmed.Rds")
Now we can find the most frequently appearing words in the corpus.
nobel_tidy %>%
count(words, sort=TRUE)
## # A tibble: 12,329 × 2
## words n
## <chr> <int>
## 1 peace 1595
## 2 war 927
## 3 world 847
## 4 prize 646
## 5 nations 642
## 6 nobel 597
## 7 international 530
## 8 people 513
## 9 time 441
## 10 human 440
## # … with 12,319 more rows
## # ℹ Use `print(n = ...)` to see more rows
Remember we got this nice and informative result only because we already removed the stopwords. OTherwise we would have been swamped with “the”, “and”, and so on.
Our knowledge of how to subset tibbles will now come in pretty handy. If we want to get the most frequent words, say, before 1945 we easily do this.
nobel_tidy %>%
filter(Year < 1945) %>%
count(words, sort=TRUE)
## # A tibble: 4,801 × 2
## words n
## <chr> <int>
## 1 peace 346
## 2 war 301
## 3 nations 193
## 4 international 157
## 5 league 145
## 6 world 110
## 7 time 81
## 8 american 72
## 9 prize 72
## 10 europe 69
## # … with 4,791 more rows
## # ℹ Use `print(n = ...)` to see more rows
And we could then compare it to post-1945 word frequency and we’d see pre-WWII prize speeches were more concentrated on Europe, the League, and nations while after that we get more talk of world, people, human, committee. That makes sense.
Let’s graph this.
nobel_tidy %>%
count(words, sort=TRUE) %>%
top_n(15) %>% # selecting to show only top 15 words
mutate(words = reorder(words,desc(n))) %>% # this will ensure that the highest frequency words appear to the left
ggplot(aes(words, n)) +
geom_col()
## Selecting by n
And with just a little bit more code we can view pre-1945 and post-1945 top frequency words at the same time.
nobel_tidy %>%
mutate(Period = ifelse(Year <= 1945, "Pre-WWII", "Post-WWII")) %>% # creating columns with label "Pre-WWII" and "Post-WWII"
mutate(Period = factor(Period, levels = c("Pre-WWII", "Post-WWII"))) %>%
group_by(Period) %>% # grouping by this column label so frequencies will be calculated within group
count(words, sort=TRUE) %>%
mutate(proportion = n / sum(n) * 1000) %>% # perhaps we'd like word frequency per 1000 words rather than raw counts?
slice_max(order_by=proportion, n = 15) %>% # selecting to show only top 15 words within each group
ggplot(aes(reorder_within(x = words, by = proportion, within = Period), proportion, fill = Period)) + # reordering is a bit tricky, see ?reorder_within()
geom_col() +
scale_x_reordered() +
coord_flip() +
facet_wrap(~Period, ncol = 2, scales = "free") +
xlab("Word")
One common visualization of word frequency is word clouds. To do this we use the package wordcloud which will work very nicely with our tidily organized data. Wordcloud2 gives color and more fancy options that you can also play with.
library(wordcloud)
library(wordcloud2)
nobel_tidy %>%
count(words, sort=TRUE) %>%
with(wordcloud(words, n, max.words = 100))
dat <- nobel_tidy %>%
count(words, sort=TRUE) %>%
mutate(word = words) %>%
mutate(freq = n) %>%
select(word, freq) %>%
top_n(200)
## Selecting by freq
wordcloud2(dat, size = 2)
It’s also possible to do word clouds that compare two documents. To do this we’ll need to step outside the tidyverse and organize our data in other formats so we save this for Session 5.
As we’ve seen, there are multiple ways to calculate frequency – we can take raw counts, or term frequency (\(tf\)). Proportions, term frequency divided by total token count in a given document, are another means. Certainly proportions make it easier to compare across corpora of different size. The problem with this is that they tend to get flooded with stopwords. One means of dealing with this is to remove the stopwords as we have done, but another is to attempt to downweigh words that appear often everywhere and upweigh those that are more unusual. Inverse document frequency (\(idf\)) is a weighting system to do this – it equals the total number of documents in the corpus divided by the number of documents in the corpus that contain the given word. The greater the number of documents in the corpus in which the word does not appear (suggesting words that are unique to certain documents rather than widespread across the corpus as a whole) the smaller the denominator and, thus, the greater the ratio.
TF-IDF is is term frequency times inverse document frequency. Both are often logged. In symbols,
\[\begin{equation} \text{TF}_{t,d} = \begin{cases} 1 + \text{log}_{10} \: \text{count(t,d)}, & \text{if count(t,d)} > 0 \\ 0,&\text{otherwise.}\\ \end{cases} \end{equation}\]
\[\begin{equation} \text{IDF}_{t} = \text{log}_{10} \: \bigg(\frac{N}{\text{df}_t}\bigg). \end{equation}\]
where t is the given term, d is a given document, df\(_t\) is the number of documents in the corpus containing term t, and N is the total number of documents in the corpus. Long story short, tf-idf attempts to weight words by both frequency in an individual document and their unusualness over a corpus of documents. Every word in every document will have its own tf-idf (term frequency will vary across documents while inverse document frequency is the same across the corpus).
Let’s see how we do this in R. First we’ll compute document
frequency. In order to simplify results, lets use the same subsetting of
our data into pre-1945 and post-1945 – this means we’re treating
pre-WWII speeches as one single document and likewise post-1945
speeches. tidytext makes it pretty easy – simply unnest
tokens and then count the tokens. Note that count enable
counting within groups, which we passed to count telling it
to do the counts within the groups denoted in column
Period. This produces the same result as passing
group_by(Period) in the previous line and eliminating
Period from the count() call.
nobel <- read_rds("data/nobel_cleaned.Rds") %>%
mutate(Period = ifelse(Year <= 1945, "Pre-WWII", "Post-WWII"))
nobel_words <- nobel %>%
unnest_tokens(words, AwardSpeech) %>%
count(words, Period, sort = TRUE)
We can then use bind_tf_idf() from tidytext. (Tidytext
implements tf-idf using proportional, but not logged, tf – we’ll see
some of these other versions in other packages). The function takes a
first argument (other than the tidy dataframe) that is the word, a
second that is the document, and third a column containing document-term
counts).
tf_idf <- nobel_words %>%
bind_tf_idf(words, Period, n)
tf_idf
## # A tibble: 17,036 × 6
## words Period n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 the Post-WWII 13947 0.0742 0 0
## 2 of Post-WWII 7281 0.0387 0 0
## 3 in Post-WWII 5614 0.0299 0 0
## 4 to Post-WWII 5591 0.0297 0 0
## 5 and Post-WWII 5417 0.0288 0 0
## 6 a Post-WWII 3712 0.0197 0 0
## 7 the Pre-WWII 3526 0.0761 0 0
## 8 that Post-WWII 2543 0.0135 0 0
## 9 is Post-WWII 2329 0.0124 0 0
## 10 of Pre-WWII 2191 0.0473 0 0
## # … with 17,026 more rows
## # ℹ Use `print(n = ...)` to see more rows
Here we see that tf-idf has zeroed out these extremely common and not very interesting terms, precisely what we’d hope an indicator like this would do. Lets see the highest tf-idf scores.
tf_idf %>% arrange(desc(tf_idf))
## # A tibble: 17,036 × 6
## words Period n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 nuclear Post-WWII 278 0.00148 0.693 0.00102
## 2 saavedra Pre-WWII 22 0.000475 0.693 0.000329
## 3 locarno Pre-WWII 20 0.000432 0.693 0.000299
## 4 angell Pre-WWII 19 0.000410 0.693 0.000284
## 5 pan Pre-WWII 19 0.000410 0.693 0.000284
## 6 ladies Post-WWII 71 0.000378 0.693 0.000262
## 7 global Post-WWII 60 0.000319 0.693 0.000221
## 8 non Post-WWII 54 0.000287 0.693 0.000199
## 9 poverty Post-WWII 52 0.000277 0.693 0.000192
## 10 persons Post-WWII 51 0.000271 0.693 0.000188
## # … with 17,026 more rows
## # ℹ Use `print(n = ...)` to see more rows
tf_idf %>%
mutate(Period = factor(Period, levels = c("Pre-WWII", "Post-WWII"))) %>%
group_by(Period) %>%
slice_max(tf_idf, n = 15) %>%
ungroup() %>%
ggplot(aes(tf_idf, fct_reorder(words, tf_idf), fill = Period)) +
geom_col(show.legend = FALSE) +
facet_wrap(~Period, ncol = 2, scales = "free") +
labs(x = "tf-idf", y = NULL)
Many of these are names, logically enough as specific names of laureates appear mostly only when they are being awarded a prize thus increasing their idf and upweighing their tf-idf. To make this more useful we’d want to go through and week out (remove, just like stopwords) names. But still we see nuclear coming to the fore in the post-war era, reasonably enough! Things like “global”, “poverty” stand out post-war while pre-war we see “reparations”, “commercial”, more Europe specific vocabulary. A different conversation pre- and post-war.
We have now calculated lists of words appearing with top frequency and also seen how to calculate and graph words or phrases of interest across documents. But what if we were interested in seeing highest frequency words within a larger set of words? If were were interested in a limited set of, say, geographical names or people we could just make the list, do a join to remove all other words and calculate top frequencies among this subset of words. But what if we were interested in, say, highest occurring verbs?
For this and many other reasons linguists have long been interested in writing algorithms to automatically identify parts of speech, called part-of-speech tagging or POS. A POS tagger would take a sentence such as “The brown fox jumped over the brown log” and return “The_(article) brown_(adjective) fox_(noun) over_(preposition)” and so on.
One way we could imagine going about this is with a dictionary that identified nouns, verbs, etc. but this is clearly naive – many a word can be both noun and verb and much will depend on context. So we’ll use the openNLP package and write this function5
library(NLP)
library(tm) # load before openNLP
library(openNLP)
library(openNLPmodels.en)
POStagger <- function(text, filter = NULL){
# defining annotators
sent_token_annotator <- Maxent_Sent_Token_Annotator()
word_token_annotator <- Maxent_Word_Token_Annotator()
pos_tag_annotator <- Maxent_POS_Tag_Annotator()
# make sure input is a string
text <- as.String(text)
# defining a pipeline
annotate_text <- NLP::annotate(text, Annotator_Pipeline(
sent_token_annotator,
word_token_annotator,
pos_tag_annotator
))
# generate tags
tags <- sapply(subset(annotate_text, type=="word")$features, `[[`, "POS")
# get tokens
tokens <- text[subset(annotate_text, type=="word")]
# filter out just one pos type
if (!is.null(filter)) {
tagged <- tokens[tags %in% filter]
} else {
# put tokens and pos tags in df
tagged <- tibble(word=tokens, pos=tags) %>%
filter(!str_detect(pos, pattern='[[:punct:]]'))}
return(tagged)
}
nobel <- read_rds("data/nobel_cleaned.Rds")
POStagger(nobel$text[1])
This yields a nice dataframe with tokens in the first column and POS in the second column. The POS are not just “noun”, “verb” and so on but more finely differentiated. See here for a list of abbreviations. Using the same filter, group_by, etc methods we can now find lists of most frequently words by type of speech. There are, of course, other applications we might imagine when POS-tagging is used in conjunction with other topics discussed in the workshop.
You don’t really need to worry about what’s in there but that it takes a character string (preferably cleaned, as always, to limit the chance of unexpected errors). Note, too, that the function has the option of specifying a filter that will return a string of words of a certain POS.
Another commonly done sort of analysis that we can easily incorporate into our tidy workflow is that of sentiment analysis. There are numerous ways computational linguists have developed to algorithmically determine the “sentiment” of a piece of text. We focus here on a simple (and fairly naive) method that works on the level of individual words and employs a sentiment dictionary, essentially a list of words and their associated sentiments. As we will see, these same dictionary methods can also be applied to any list or lists of words and thus give analysts a more flexible tool to track vocabulary across a corpus.
tidytext includes three dictionaries, each working
slightly differently.6 Let’s take a look below.
library(tidytext)
# the first time you will need to say yes to download of the sentiment dictionary
get_sentiments("afinn")
## # A tibble: 2,477 × 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # … with 2,467 more rows
## # ℹ Use `print(n = ...)` to see more rows
tail(get_sentiments("afinn"))
## # A tibble: 6 × 2
## word value
## <chr> <dbl>
## 1 youthful 2
## 2 yucky -2
## 3 yummy 3
## 4 zealot -2
## 5 zealots -2
## 6 zealous 2
Here we have negative or positive sentiment (words beginning with
“ab-” seem to be quite negative!). If we want to get a quick overview of
the scale we can call either table() or
summary()
table(get_sentiments("afinn")$value) # for categorical data, tells us categories and n of those categories
##
## -5 -4 -3 -2 -1 0 1 2 3 4 5
## 16 43 264 966 309 1 208 448 172 45 5
summary(get_sentiments("afinn")$value) # summary statistics for the value column
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.0000 -2.0000 -2.0000 -0.5894 2.0000 5.0000
So we see the value of sentiments goes from -5 to 5. Calling the other two:
get_sentiments("bing")
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # … with 6,776 more rows
## # ℹ Use `print(n = ...)` to see more rows
get_sentiments("nrc")
## # A tibble: 13,872 × 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # … with 13,862 more rows
## # ℹ Use `print(n = ...)` to see more rows
“Bing” labels words simply negative or positive and the “nrc” lexicon labels them according to a handful of main emotions. You can take a look at the tidytext documentation for more background on the lexicons.7
So what do we do with these? The simplest method is simply to count words from the lexicons that appear in a text and add them all up. This is what is known as a dictionary method, which we’ll talk about in a little more depth below. From a practical standpoint, how do we add up all dictionary words in a text? Tidy makes this pretty easy. You’ll see the code is a little bit different based on the kind of dictionary.
For the bing dictionary we will tokenize our corpus, do
an inner join8 with the sentiments dictionary which will
create a new row in our Nobel corpus dataframe where every row is the
associated sentiment with the corpus word and corpus words that do not
have a sentiment associated with them in the bing
dictionary will be eliminated from the dataframe. We then do a
pivot_wider, the opposite of pivot_longer
which we did in the previous session when graphing n-grams.
pivot_wider will make several columns out of fewer – here
taking the values of the sentiment column (“positive” and
“negative”) and making them column titles and populating the column
values with those from the associated rows of our n column.
Take a look at what the dataframe looks like before and after
transformation to make sure you understand what is happening. We do this
pivot in order to then be able to do basic arithmetic – take the number
of “positive” word frequencies and subtract the number of negative word
frequencies, and do this for every document in the corpus. This will be
our sentiment “score”. It is then a fairly straightforward path to
graphing it on ggplot.
library(tidyverse)
nobel <- read_rds("data/nobel_cleaned.Rds")
# calculating text sentiment by subtracting total positive sentiment words from total negative with bing lexicon
nobel %>%
unnest_tokens(word, AwardSpeech) %>% ## we call our new column "word" which makes inner_joins easier
inner_join(get_sentiments("bing")) %>%
count(sentiment, year = Year) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ggplot(aes(year, sentiment)) +
geom_line(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = 2, alpha = .8)
With the afinn corpus we have numeric “sentiment values”
attached to words so we need to sum all the values of all words in each
document.
# using the afinn lexicon
nobel %>%
unnest_tokens(word, AwardSpeech) %>% ## we call our new column "word" which makes inner_joins easier
inner_join(get_sentiments("afinn")) %>%
group_by(Year) %>%
summarize(sentiment = sum(value)) %>%
ggplot(aes(Year, sentiment)) +
geom_line(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = 2, alpha = .8)
We can also do this over different categories over years. For instance, say we’d like to do this same kind of chart but for each of our oil company’s sustainability reports.
sr <- srps <- read_csv("data/srps.csv")
## Rows: 3543 Columns: 3
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Text, Company
## dbl (1): Year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sr %>%
unnest_tokens(word, Text) %>% ## we call our new column "word" which makes inner_joins easier
inner_join(get_sentiments("afinn")) %>%
group_by(Company, Year) %>%
summarize(sentiment = sum(value)) %>%
ggplot(aes(Year, sentiment)) +
geom_line(show.legend = FALSE) +
geom_hline(yintercept = 0, linetype = 2, alpha = .8) +
facet_wrap(~Company, ncol = 2, scales = "free_x")
## Joining, by = "word"
## `summarise()` has grouped output by 'Company'. You can override using the `.groups` argument.
We can also use the sentiment lexicons in ways that mostly combine what we have already done in subsetting dataframes and counting words. Say we’d like to see which words in a certain category of documents are associated with negative anticipation in the pre-WWII vs post-WWII Nobel prize award speeches. We’ll do this in several steps.
First we get a dictionary of the words associated with anticipation.
Then do the same with negative sentiment words. Then we again do an
inner_join() that creates a new dataframe that includes
only those words present in both anticipation and
negative.
anticipation <- get_sentiments("nrc") %>%
filter(sentiment == "anticipation")
negative <- get_sentiments("nrc") %>%
filter(sentiment == "negative")
negAnt <- inner_join(anticipation, negative, by="word") %>%
select(word) # we don't really need the other two columns which tell us that they are negative and anticipation words
Now we make our Nobel data “tidy” (one-token-per-row) and do another
inner_join() with our negative anticipation words. What we
have left are all negative anticipation words in the corpus, which we
can then print.
nobel %>%
mutate(Period = ifelse(Year >= 1945, "Pre-WWII", "Post-WWII")) %>%
mutate(Period = factor(Period, levels = c("Pre-WWII", "Post-WWII"))) %>% # CHanges
unnest_tokens(word, AwardSpeech) %>%
inner_join(negAnt) %>%
group_by(Period) %>%
count(word, sort = TRUE) %>%
slice_max(order_by=n, n = 15) %>% # selecting to show only top 15 words within each group
ggplot(aes(reorder_within(x = word, by = n, within = Period), n, fill = Period)) + # reordering is a bit tricky, see ?reorder_within()
geom_col(show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(~Period, ncol = 2, scales = "free") +
theme_bw() +
xlab('Words')
## Joining, by = "word"
Wait, “mother” is a word associated with both negativity and anticipation?
get_sentiments("nrc") %>%
filter(word == "mother")
## # A tibble: 6 × 2
## word sentiment
## <chr> <chr>
## 1 mother anticipation
## 2 mother joy
## 3 mother negative
## 4 mother positive
## 5 mother sadness
## 6 mother trust
Apparently so, according to this lexicon. Which offers us a nice segue into a larger discussion of dictionaries.
It is entirely obvious to historians that the meaning of words – not to mention latent sentiments – changes over time, place, and context within individual documents. So these methods have to be used with care. To take one instance, economists have noticed that words like “liability” have very different sentiment and latent connotations in finance and other areas of life (Loughran and Mcdonald (2011)). Thus it might behoove us to create our own dictionaries and this is something we could easily do. In this case, we (as domain experts) could compile our own dictionaries not only to judge sentiment, but to measure attention given to certain topics, tone of discussion,
Here really the only challenge is to read a dictionary into R and then use it per the above methods. Say that we have a list of words in a text file, not comma-separated, that is a dictionary for identifying when the theme of oil is brought up in a text (this is something I came up with in literally 5 seconds, please do not use this for anything – actual dictionaries should be considerably better thought out, as well as verified on real texts that it’s catching what you want it to). Each word is on its own separate line (coincidentally this is the format of the above cited Norwegian sentiment dictionary, so should you want to use that you can adopt the process here.)
oil_dict <- read_lines("data/oil_theme.txt")
oil_dict <- tibble(word = oil_dict, dictionary = "oil dictionary")
oil_dict %>%
filter(word != "")
## # A tibble: 6 × 2
## word dictionary
## <chr> <chr>
## 1 oil oil dictionary
## 2 petrol oil dictionary
## 3 petroleum oil dictionary
## 4 gas oil dictionary
## 5 tanker oil dictionary
## 6 fossil oil dictionary
In all of two lines of code we’ve made it into a tibble. We can see also we have an empty string as the last line. We take care of that and then we’re all set to use this dictionary. Now if we want to add up everytime a text uses a word in this dictionary we can easily do this.
In Google’s famous n-grams viewer, you can search not just for a single word but for phrases, sets of words “n” tokens long. Indeed, this is why it is called an “n”-gram, they are consecutive sequences of an arbitrary number of words. So far we’ve only been looking at individual words, so lets think about multi-word units.
True to form, the tidyverse will help out with this, allowing us to
unnest_tokens() by the n-gram by telling R that our token
of interest is now not the single word but an ngram of n length.
nobel %>%
unnest_tokens(twogram, AwardSpeech, token = "ngrams", n = 2)
## # A tibble: 234,267 × 3
## Year Laureate twogram
## <dbl> <chr> <chr>
## 1 1905 Bertha von Suttner "on behalf"
## 2 1905 Bertha von Suttner "behalf of"
## 3 1905 Bertha von Suttner "of the"
## 4 1905 Bertha von Suttner "the nobel"
## 5 1905 Bertha von Suttner "nobel committee"
## 6 1905 Bertha von Suttner "committee bj\u00f8rnstjerne"
## 7 1905 Bertha von Suttner "bj\u00f8rnstjerne bj\u00f8rnson"
## 8 1905 Bertha von Suttner "bj\u00f8rnson introduced"
## 9 1905 Bertha von Suttner "introduced the"
## 10 1905 Bertha von Suttner "the speaker"
## # … with 234,257 more rows
## # ℹ Use `print(n = ...)` to see more rows
And we could then make a plot of the most frequently occurring
2-grams. The problem with this is that we’re going to be overrun with
stopwords, but now that we have bigrams how to do take the individual
stop words? We could search and remove with str_remove_all
without unnesting words, we could unnest, take out the stop words,
renest and then unnest by 2-grams, but we could also use a more tidy
approach and separate() which is another handy command for
manipulating tidyverse dataframes.
nobel %>%
unnest_tokens(twogram, AwardSpeech, token = "ngrams", n = 2) %>%
separate(twogram, into=c("word1", "word2"), sep = " ") # here the call states what col to separate, into which columns, and where the separation should be made, here when there is a space between the words
## # A tibble: 234,267 × 4
## Year Laureate word1 word2
## <dbl> <chr> <chr> <chr>
## 1 1905 Bertha von Suttner "on" "behalf"
## 2 1905 Bertha von Suttner "behalf" "of"
## 3 1905 Bertha von Suttner "of" "the"
## 4 1905 Bertha von Suttner "the" "nobel"
## 5 1905 Bertha von Suttner "nobel" "committee"
## 6 1905 Bertha von Suttner "committee" "bj\u00f8rnstjerne"
## 7 1905 Bertha von Suttner "bj\u00f8rnstjerne" "bj\u00f8rnson"
## 8 1905 Bertha von Suttner "bj\u00f8rnson" "introduced"
## 9 1905 Bertha von Suttner "introduced" "the"
## 10 1905 Bertha von Suttner "the" "speaker"
## # … with 234,257 more rows
## # ℹ Use `print(n = ...)` to see more rows
Now we can do our stopword work. We could do an
anti_join on word1 and then word2 but the other nifty thing
we can do with tidy commands is use the %in% command (this
being the syntax of commands used in SQL databases some of which the
tidyverse lets you do).
nob <- nobel %>%
mutate(Period = ifelse(Year <= 1945, "Pre-WWII", "Post-WWII")) %>%
unnest_tokens(twogram, AwardSpeech, token = "ngrams", n = 2) %>%
separate(twogram, into=c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(twogram, word1, word2, sep = ' ') %>% # now putting word1 and word2 back into a single column called twogram
group_by(Period) %>%
count(twogram, sort=TRUE) %>%
slice_max(order_by=n, n = 10)
nob %>%
mutate(twogram = reorder(twogram, n)) %>%
ungroup() %>%
ggplot(aes(reorder_within(x = twogram, by = n, within = Period), n, fill = Period)) +
geom_col(show.legend = FALSE) +
scale_x_reordered() +
coord_flip() +
facet_wrap(~Period, ncol = 2, scales = "free") +
ylab('n') +
xlab("2-gram")
Words can also co-occur in the same context even if they are not necessarily right next to each other. To get at this we’ll need more than n-grams. One way to do this is the widyr package, which essentially restructures our tidy data does an operation and the recasts it back into a tidy format.9
For instance, we can use the pairwise_count() function
to find all the words that appear together in the same Nobel speech (not
necessarily right next to each other).
library(widyr)
nobel %>%
unnest_tokens(word, AwardSpeech) %>%
filter(!word %in% stop_words$word) %>%
pairwise_count(word, Year, sort = TRUE)
## # A tibble: 22,374,876 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 time peace 89
## 2 peace time 89
## 3 prize peace 86
## 4 peace prize 86
## 5 peace nobel 85
## 6 nobel peace 85
## 7 war peace 85
## 8 world peace 85
## 9 peace war 85
## 10 peace world 85
## # … with 22,374,866 more rows
## # ℹ Use `print(n = ...)` to see more rows
Notice that this is a much larger dataframe – our unnested dataframe minus stopwords was ~93.000 words, this new one is +21 million.
From here we might do many things using the same skills of
subsetting, counting, and plotting we have talked to up to now. We
filter for words co-occurring with “evil” and plot the most frequently
occurring in the same document as “evil”. We might also note that
unnest_tokens() will can also tokenize by sentence. This
means that if you wanted to look at co-occurrence within sentences you
could tokenize by sentence and give each sentence (which would occupy
one row each) an index number (something like
mutate(index = row_number()) might do the trick) and then
call pairwise_count.
We could also start talking about tokens and words being not just things we’d recognize as words but things like :) and so on.↩︎
We will also later talk about n-grams where we might take tokens to be 2-grams, sentences or even longer pieces of text.↩︎
Interestingly, one of the most famous cases of computational text analysis in the social science – Mosteller and Wallace’s attribution of anonymously published Federalist papers – analyzed stopwords and threw out everything else (Mosteller and Wallace (1963)).↩︎
This section and the next lean heavily on chapters 1 and 3 in Silge and Robinson (2017)↩︎
Based on examples from Clark (2018), Schweinberger (2021), and especially Niekler and Wiedemann (2020).↩︎
these are English-language, it should be said. There are many for other languages out there, see here for an extensive Norwegian positive/negative sentiment dictionary. Some of these will involve a little wrangling of the data to read into R and get them into tidy format.↩︎
For some discussion of the nrc lexicon and more background on sentiment analysis, see chapter 20 in Jurafsky and Martin (forthcoming). One of the earliest and best known sentiment dictionaries is the General Inquirer.↩︎
See chapter 13 in Wickham and Grolemund (2016) for great visualizations of joins and translation into R commands.↩︎
There are simple statistics to test the correlation
between words in the same document (phi-test – in widyr as
pairwise_corr), statistics to measure differences in word
use between documents (log-likelihood, chi-squared, keyness), and so on.
These are a bit much to cover in a two day workshop but area all fairly
easy to implement in R and you can easily find tutorials and explainers
that will help you do this.↩︎
2022.